🤖 LLM Inference - codenm.no2 · Scour

Fast Heterogeneous Serving: Scalable Mixed-Scale LLM Allocation for SLO-Constrained Inference 🧠LLM

arxiv.org·5d

Introducing dotLLM - Building an LLM Inference Engine in C# 🧠LLM

kokosa.dev·12h·Hacker News

amitshekhariitbhu/llm-internals: Learn LLM internals step by step - from tokenization to attention to inference optimization. 🧠LLM

github.com·1d·Hacker News

I-DLM: Introspective Diffusion Language Models 🧠LLM

introspective-diffusion.github.io·20h·Hacker News, r/LocalLLaMA

Inside LLM Inference: KV Cache, Prefill, and the Decode Bottleneck 🧠LLM

pub.towardsai.net

·5d

Stop benchmarking inference providers, a guide to easy evaluation 🤖Large Language Models

huggingface.co·13h·r/LocalLLaMA

Model API Performance 🤖Large Language Models

news.ycombinator.com·18h·Hacker News

Quantization, LoRA, and the 8% Problem: Benchmarking Local LLMs for Production AI 💬LLMs

walsenburgtech.com·3d·Hacker News

LLM inference, optimized for your Mac ✍️Prompt Engineering

omlx.ai·4d·Hacker News

LLM inference engine written ground-up natively in C#/.NET 🧠LLM

dotllm.dev·11h·Hacker News

Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference 💬LLMs

arxiv.org·1d

patilyashvardhan2002-byte/lazy-moe: The GPU-free LLM inference engine. Combines lazy expert loading + TurboQuant KV compression to run models that shouldn't fit on your hardware. Built from scratch, fully local, zero cloud. 💾Bytecode

github.com·2d·r/LocalLLaMA

Four Reasons Why FPGAs Hit the Sweet Spot for LLM Inference 🏗️RISC-V

pub.towardsai.net

·13h

Watt Counts: Energy-Aware Benchmark for Sustainable LLM Inference on Heterogeneous GPU Architectures 💬LLMs

arxiv.org·2d

A-IO: Adaptive Inference Orchestration for Memory-Bound NPUs 🔬eBPF

arxiv.org·1d

StreamServe: Adaptive Speculative Flows for Low-Latency Disaggregated LLM Serving 📨Event-Driven Architecture

arxiv.org·1d

Blink: CPU-Free LLM Inference by Delegating the Serving Stack to GPU and SmartNIC 🧠LLM

arxiv.org·5d

QCFuse: Query-Centric Cache Fusion for Efficient RAG Inference 🎯Retrieval Systems

arxiv.org·2d

MEMENTO: Teaching LLMs to Manage Their Own Context 💬LLMs

arxiv.org·1d

Scheduling the Unschedulable: Taming Black-Box LLM Inference at Scale 🔬eBPF

arxiv.org·6d

Loading more...